Machine Learning for Data Linkage

نویسندگان

چکیده

Data linkage traditionally uses deterministic and probabilistic methods. Alternatively, machine learning methods can be applied as classification algorithms, using the data to inform decisions. This project compared quality, in terms of precision recall, traditional with selected when a standard problem.
 Two supervised methods, gradient boosted trees (GBT) multiple layered perceptron classifier (MLPC), one unsupervised method, maximum entropy (MEC), were implemented. The England Wales 2021 Census Coverage Survey (CCS) was used gold-standard (GS) linked dataset provide training samples for well testing all F1 score (harmonic mean recall) compare performance models determine optimal parameters thresholds.
 Splink implementation Fellegi-Sunter Expectation Maximisation baseline comparison.
 trained on sample GS, link census CCS data. All performed MEC achieving highest (99.79%) but lowest recall (96.36%). MLPC model achieved (98.94%).
 To understand implications not retraining each dataset, also health dataset. retrained data; instead, optimised GS applied. had (96.51%) (98.48%) (97.49%). With scores 96.99% 96.14% respectively, GBT far behind performance, despite being data.
 We have shown that effectively problems. Unsurprisingly, perform best same Further research into generic may allow us use both future linkage.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Machine Learning Models for Housing Prices Forecasting using Registration Data

This article has been compiled to identify the best model of housing price forecasting using machine learning methods with maximum accuracy and minimum error. Five important machine learning algorithms are used to predict housing prices, including Nearest Neighbor Regression Algorithm (KNNR), Support Vector Regression Algorithm (SVR), Random Forest Regression Algorithm (RFR), Extreme Gradient B...

متن کامل

Machine Learning, Information Retrieval, and Record Linkage

Classification into groups using terms available in the data underlies machine learning, information retrieval, and record linkage. Classifiers such as Bayesian networks in machine learning and term weighting in information retrieval depend primarily on training data sets for which truth is known. These classifiers may be relatively slow to adapt to new situations in which new data have charact...

متن کامل

Improving the Performance of Machine Learning Algorithms for Heart Disease Diagnosis by Optimizing Data and Features

Heart is one of the most important members of the body, and heart disease is the major cause of death in the world and Iran. This is why the early/on time diagnosis is one of the significant basics for preventing and reducing deaths of this disease. So far, many studies have been done on heart disease with the aim of prediction, diagnosis, and treatment. However, most of them have been mostly f...

متن کامل

Machine Learning for Sequential Data: A Review

Statistical learning problems in many fields involve sequential data. This paper formalizes the principal learning tasks and describes the methods that have been developed within the machine learning research community for addressing these problems. These methods include sliding window methods, recurrent sliding windows, hidden Markov models, conditional random fields, and graph transformer net...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: International Journal for Population Data Science

سال: 2023

ISSN: ['2399-4908']

DOI: https://doi.org/10.23889/ijpds.v8i2.2240